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Abstract 

Background: Identifying genetic variants associated witli complex human diseases is a great challenge in 
genome-wide association studies (GWAS). Single nucleotide polymorphisms (SNPs) arising from genetic background 
are often dependent. The existing methods, i.e., local index of significance (LIS) and pooled local index of significance 
(PUS), were both proposed for modeling SNP dependence and assumed that the whole chromosome follows a 
hidden Markov model (HMM). However, the fact that SNP data are often collected from separate heterogeneous 
regions of a single chromosome encourages different chromosomal regions to follow different HMMs. In this research, 
we developed a data-driven penalized criterion combined with a dynamic programming algorithm to find change 
points that divide the whole chromosome into more homogeneous regions. Furthermore, we extended PUS to 
analyze the dependent tests obtained from multiple chromosomes with different regions for GWAS. 

Results: The simulation results show that our new criterion can improve the performance of the model selection 
procedure and that our region-specific PUS (RSPLIS) method is better than PUS at detecting disease-associated SNPs 
when there are multiple change points along a chromosome. Our method has been used to analyze the Daly study, 
and compared with PUS, RSPLIS yielded results that more accurately detected disease-associated SNPs. 

Conclusions: The genomic rankings based on our method differ from the rankings based on PLIS. Specifically, for the 
detection of genetic variants with weak effect sizes, the RSPLIS method was able to rank them more efficiently and 
with greater power. 



Background 

At the present time, genome-wide association studies 
(GWAS) have become a very popular tool for identi- 
fying novel genetic variants for complex traits. GWAS 
typically tests hundreds of thousands of markers simul- 
taneously, making it necessary to improve the power of 
large-scale multiple testing. Fortunately, the false discov- 
ery rate (FDR) for controlling such procedures, which 
was introduced in a seminal paper [1], is one of the 
most important methodological developments in mul- 
tiple hypothesis testing and has played successful role 
in many large-scale multiple testing studies. Such stud- 
ies include multi-stage clinical trials, microarray experi- 
ments, brain imaging studies, and astronomical surveys. 
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amongst others [2-10]. Recently, the FDR approach has 
also been applied to GWAS [11]. Naturally, despite the 
increasing popularity of FDR, most of the traditional ana- 
lytical methods for GWAS with FDR control have largely 
been proposed for individual single nucleotide polymor- 
phism (SNP) analysis. However, because SNPs on the 
same chromosome are in local linkage disequilibrium 
(LD), which results in the complex dependence and cor- 
relation among large-scale tests, the traditional FDR con- 
trolling procedure for independent tests can potentially 
be conservative and lead to a loss of power. Therefore, it 
is important to consider FDR control for multiple testing 
procedures when the tests are dependent in GWAS. 

Fortunately, Wei and Li pointed out that genomic 
dependency information could significantly improve the 
efficiency of analysis of large-scale genomic data [12,13]. 
We also expect that information about SNP dependency 
can be exploited to construct tests that are more efficient. 
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From a biological point of view, SNP dependency is infor- 
mative when constructing more efficient association tests, 
because when a SNP is associated with a disease, it is likely 
that the neighboring SNPs are also disease-associated 
(owing to the co- segregation). Therefore, when decid- 
ing the significance level of a SNP, its neighboring SNPs 
should be taken into account. Sun and Cai [14] proposed 
a local index of significance (LIS) controlling procedure 
that uses a hidden Markov model (HMM) to represent 
the dependence structure, and has shown its optimality 
under certain conditions and its strong empirical perfor- 
mance. Furthermore, this LIS procedure was extended 
to pooled local index of significance (PLIS) procedure 
for multiple-chromosome analysis [15], where the authors 
developed chromosome-specific HMMs for analysis of 
the SNP data arising from large-scale GWAS. Instead of 
HMM, Li [16] introduced a hidden Markov random field 
model (HMRFM) to account for LD when analyzing the 
SNP data from GWAS. 

Therefore, the above methods are all based on the strong 
assumption that each chromosome follows a HMM or 
HMRFM. However, there are usually various LD patterns 
or haplotype blocks in the data, which result in hetero- 
genity of dependencies among SNPs and variations in 
the disease risk rates of casual alleles in the different 
blocks. Hence, we suggest that the different blocks of 
each chromosome should follow different HMMs. Wei 
et al. [15] have stated that the development of a multiple 
testing procedure essentially involves two steps: ranking 
the hypotheses and choosing a cutoff along the rank- 
ings, where the ranking step is more fundamental. Obvi- 
ously, modeling different regions by different HMMs can 
improve the efficiency of ranking. To this end, we should 
first identify change points for each chromosome, which 
can be used to divide the whole chromosome into smaller 
homogeneous regions. Specifically, we need to determine 
the number of change points as well as their locations on 
chromosomes. 

In addition, the existing methods [14,15] assume that 
the observation variables follow a normal mixture dis- 
tribution conditional on the latent variables in HMM, 
where the number of components is unknown in the 
normal mixture distribution. Sun and Cai [14] showed 
that the number of components should be determined 
by the likelihood-based Bayesian information criterion 
(BIG). However, BIG, as well as many other existing cri- 
terions, rely on a strong assumption that the observations 
are independent and require large sample sizes to reach 
their asymptotic consistency behavior. While in HMM, 
the observations are dependent such that the effective 
sample size for these criterions may be small. 

In this paper, we first focus on the problem of how to 
infer simultaneously the number of components in a nor- 
mal mixture distribution as well as the change points of 



each chromosome. We put forward a data-driven penal- 
ized criterion for model selection in HMM, and propose 
a sliding window-based improved version of the dimen- 
sion jump method [17] to estimate this criterion. We 
then applied the dynamic programming (DP) algorithm to 
find multiple change points. The numerical results show 
that the proposed adaptive criterion has better perfor- 
mance than the original version. Second, we extended the 
approach of Wei et al. [15] to develop a testing proce- 
dure, which we have called region-specific PLIS (RSPLIS), 
for the analysis of different chromosomes with multiple 
regions. The numerical results show that RSPLIS outper- 
forms PLIS in a disease association study. Our proposed 
procedure has been used to analyze the data from Daly 
et al. [18] for identifying Grohn's disease-associated SNPs. 

Methods 

First, Sun and Gai [14] developed a compound decision- 
theoretic framework for testing HMM-dependent 
hypotheses and presented an optimal testing procedure 
that can be used to analyze a single chromosome for SNP 
data. Second, Wei et al. [15] proposed a PLIS approach 
for multiple-chromosome analysis. They showed that 
under some regularity conditions, the PLIS procedure is 
valid and asymptotically optimal in the sense that it can 
control the global chromosome-wise FDR at the nomi- 
nal level a and has the smallest false non-discovery rate 
(FNR) among all valid FDR procedures by combining 
the testing results from different chromosomes. In this 
section, we extend the chromosome-specific PLIS to the 
analysis of different chromosomes with multiple regions. 
In what following, we formulate our statistical model and 
elaborate the theoretical foundations of our method. 

Region-specific multi-IHMM with change points and where 
the number of components known 

In this subsection, we assume that we have known change 
points as well as the number of components in the 
normal mixture distribution. Let Wc and nic denote the 
change point set and the number of components for 
chromosome c. Suppose the case-control genotype data 
are available from the Lcr SNPs in the r-th region of 
chromosome c, c = 1, . . . , C, r = 1, ...,i?c- We let 
O^f^ = 1 indicate that SNP / from region r of chro- 

(c) 

mosome c is disease-associated and Oy = 0 otherwise. 
For each SNP, we first obtain a p-velue by conducting a 
X'^-test to assess the association between the allele fre- 
quencies and the disease status, then we convert the 
p-Yslue into a 2:-value Z^f using the transformations pro- 
posed in Wei et al. [15] for further analysis. For chro- 
mosome c, let 6'^'^ = {Ol,f;l=h...,Lcnr=h...,Rc} 
and Z^'^ = {Z^f; 1=1,.. .,Lcn r = 1, . . . ,i?c}. In the fol- 
lowing, we treat 0^^"^ as the hidden variables and Z^^^ as 
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the observed variables to consider HMMs using some 
assumptions. 

First, we assume that the observed data are conditionally 
independent given the hidden states for the same region, 
and that different regions of the same chromosome are 
independent. Then we have 

r=l 1=1 

Furthermore, let Z^f \0^f , Wc. mc ~ (1 - 0^f)Fl'^ + 
^rf^ri> ^here Tj:'^ = {F^o^F^f} denotes the observation 
distribution for each SNP in region r of the chromosome c. 
For a non-associated SNP, we assume that the z-value 

(c) 

distribution is a standard normal F^q = A/'(0, 1) for all 
regions of chromosome c, and for a disease-associated 
SNP, the value distribution is a normal mixture, whereby 

f*f = E ^i^(l^ri'<^ri )' = 1. and we assume that 

i=l i 

the number of components in the normal mixture is iden- 
tical for all regions of chromosome c. The normal mixture 
model can approximate a large collection of distributions 
and has been widely used elsewhere [19-21]. 

(c^ (c) 

Second, we assume that the hidden states Or and 0^ ^ 
are independent for the different regions, r and t For 
the r-th region of chromosome c, we assume that Or^^ = 

(c) (c) 

{0^i\ . . . , O^j^ } is distributed as a stationary Markov chain 

(c) (c) (c) 

with a transition probability <3^^^y = P(0^^j_^-^^ = J\Oy = i). 
Let us denote A^r = (^ny) transition matrix and 
Ttcr = (7tj,Q ,7Tj,fy the Stationary distribution, where 

_ M // (C) , (C) . 1 (C) _ . _ (C) 

Let (per = {(Mrf ' ^rf> ?^)' i = h..-> nic}, then we denote 
= (Acr, Ttcr, ^cr) the collection of HMM param- 
eters for r-th region of chromosome c. When Wc and 
rric are known, the maximum likelihood estimate of the 
HMM parameters can be obtained using the expectation- 
maximization (EM) algorithm [14,22]. 

Adaptive criterion-based partitioning (ACP) method for 
finding change points and the number of components 

However, in practice, change points and the number of 
components in the normal mixture distribution are often 
unknown. In this subsection, for each chromosome, we 
will give an ACP method to conduct a model selection 
procedure for simultaneously finding Wc and m^. 

Candidate change point set 

To effectively reduce the huge space of competing change 
points and save computation time, our ACP method needs 
a candidate change point set in advance. Here, we use 
a haplotype block partition method [23] to obtain the 
haplotype-block boundary points for each chromosome. 



which can be collected as the candidate change point 
set. Because the minimum length value of block Lynin 
should be pre-specified in their haplotype block partition 
method, here, we let Lmin be 300 for all our analysis. 

Adaptive criterion-based partitioning procedure 

Simultaneously inferring Mc and Wc can be regarded as a 
model selection problem. To select a desired model, the 
commonly used methods are established base on the cri- 
terion of minimizing the penalized negative maximum 
likelihood (e.g. BIC). However, many other existing cri- 
terions including BIC, assume that the observations are 
independent, which is not true in HMM. As a result, the 
effective sample sizes may be small owing to strong depen- 
dence among the observations, and the existing criterions 
may suffer from a failure of consistency. A data-driven 
penalized criterion was proposed in the Gaussian and 
least-squares regression model selection for independent 
observations [17,24,25]. Especially, [25] used this adaptive 
criterion for variable selection and clustering in Gaussian 
mixtures model and showed that this adaptive criterion 
outperforms other criterions (e.g. BIC) for small sam- 
ple sizes. Following their work, we propose a data-driven 
penalized criterion for dependent observations in HMM. 

Let Wc C denote a change point set for chromo- 
some c, \Wc\ be the number of the change points in Wc, 
and ^c = {^cn r = 1, . . . ,Rc}> where is the candidate 
change point set for chromosome c. Then, we consider a 
penalized maximum likelihood criterion with the follow- 
ing form 

critx, (mc, Wc) = -In (^P(Z^^^ l^'c, rric, Wc)^ + pen^^ (Wc, rric) 
= -In ^J2^^^^'^'^^'^\'^c,mc,Wc)P(e^^^ 

+ ^cD{mc,Wc)i 

(1) 

where is the maximum likelihood estimator of the 
parameters in HMMs for chromosome c using an 
EM algorithm, and the penalty function pen^^ {Wc, ffic) = 
K^imcWc) is designed to avoid overfit problems. In this 
penalty function, Xc > 0 is a tuning parameter to be cho- 

Rc 

sen depending on sample size Lc = ^cr and D(^mc,wc) is 

r=l 

the number of parameters in the model. Furthermore, in 
this paper, we have Di^j^^^^^) = (\wc\ + l)Dm,, where is 
the number of parameters that only depend on m^. If we 
let Xc = ^^^y^, this penalty function becomes the penalty 
function of BIC for HMM. 

Given a value of Xc in the penalized criterion 
pen^^ (tVc, Mc) = Aci)(mc,M/c)> fin^ and rkc to min- 

imize crit;^^ (rricy Wc) of equation 1 by running Algorithm 1. 
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Algorithm 1 



Algorithm 2 



Step 1. Input Kmax and ntmax 

Step 2. Given the number of components ntc = *, 
run the dynamic program (DP) algorithm to 
realize the following optimal problem, 
Wc^^=i < — arg min critx^(wc = i, Wc), 

then we can obtain a temporary change point set 
Wc,m,=i, where m,„ax' 
Step 3. Among all the temporary change point sets 



] obtained in Step 2, 



choose the cliange point set w'c,Wc=Wc ^^^^ 

number of components m^, 

which minimizes cnt},^(mc, ^cmc)- Then let 



In step 1, we need to give pre-specified values of K^ax 
and nifnax in advance, where K^ax denotes the maximum 
value of the number of change points for each chromo- 
some, and nimax is the maximum value of the number 
of component. As we know, the number of true change 
points is usually far less than the number of the candi- 
date change points in practical applications, so we can give 
a smaller value for K^ax to save computation time. For 
^maxf Wei et al. [15] suggested that values of between four 
and six are usually chosen. 

In step 2, following the methods of [26] and [27], we 
provide an optimal partitioning search method for change 
points to estimate Wc,i given Xc and nicy which is, in 
essence, a dynamic program (DP) algorithm. The detailed 
procedure about the optimal partitioning search method 
is shown in Additional file 1. 

However, in practice, Xc is unknown and needs to 
be calibrated and estimated from the data themselves. 
Slope heuristics [24] as well as its generalization, dimen- 
sion jump method [17], are practical and effective cal- 
ibration algorithms to estimate the optimal penalty 
pen^^^^^(wc, mc) = KoptD{mc,wc)' Here, we propose a slid- 
ing window-based dimension jump method to estimate 
"^copty where the sliding window is used to avoid los- 
ing cases involving several successive jumps. When the 
width of the sliding window is 1, our proposed method 
becomes the dimension jump method of Wei et al. [17]. 
The following algorithm describes the detailed procedure 
for estimating Xc^opt- 



Step 1. Given a grid for Xc* 0 < Xc,i < A.c,2 < . . . < A.c,t 
< . . . < Xc,r = In(Ic), 

where A.c,t = ^-^^^^ t = 1, . . ., T, and we set 

r = 50 in this paper. 
Step 2. For each X^u i^un Algorithm 1 to minimize 

critx^,, {ntct iVc) of equation (1), 

and we can obtain {rhc^, Wc,t)* 
Step 3. Given the width of sliding window ft > 1, 

firstly, we let set 

and f,„rf^ mini', 
then let set 

Q2 = {S*;S* € {tend- h,..., tend -l], 

and tinu < — max s, 

2 



Step4.Xc/nin ' 
Step 5. Xc,opt ^^c^in 



At the end of Algorithm 2, we can obtain the estima- 
tion Xc,opt of the Xc,opt' Having Xc^opty we can then run 
Algorithm 1 to obtain rkcopt and Wc,opt as well as the 
desired optimal model by minimizing mti^(mc,Wc) of 
Equation (1). At the same time, we can get the esti- 
mates of model parameters based on the optimal 
model, where c = 1, . . . , C. 

Pooled FDR control procedure for different chromosomes 
with multiple regions 

After each chromosome is divided into different regions 
by change points, it is desirable that that the global region- 
wide FDR can also be controlled by combining the test 
results from multiple regions of different chromosomes. 
In the following, we extend the chromosome-specific PLIS 
to the RSPLIS and operate the new procedure in three 
steps: 

Step 1. For chromosome c (c = 1, . . . , C), we search the 
change points to divide the whole chromosome 
into multiple regions using the ACP method. 
For each region r, we can get by using the 
EM algorithm from which we can calculate the 
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plug-in LIS statistic LIS^^^ = P, 
for all regions of each chromosome by using the 
forward-backward algorithm [28] . 
Combine and rank the plug-in LIS statistics 
from different regions of multiple chromosomes. 
Denote by LIS(i), . . . , LIS(l) the ordered values 
and //(I), . . the corresponding 

c R 

hypotheses, where L = J2 J2 ^cr- 

c=l r=l 

Reject all / = 1, . . . , /, where 

i 

I — max{i : J2 LIS(/) < a], 

7=1 



Step 2. 



Step 3. 



We define FNR as the expected proportion of falsely 
accepted hypotheses. Under a compound decision- 
theoretic framework, the following theorem can verify 
that our RSPLIS is valid and asymptotically optimal. We 
provide the detailed proof of the theorem in Additional 
file 1. 

Theorem 1. Consider the multi-region HMMs defined 
in section 'Region-specificmulti-HMMwith change points 
and where the number of components known'. Let 
LIS(i)y . . . yLIS(L) be the ranked LIS values from all the 
regions of all chromosomes. Then, the RSPLIS procedure 
controls the global FDR at level a. In addition, the global 
FNR level of RSPLIS is ^* + o(l), where ^* is the smallest 
FNR level among all valid FDR procedures at level a. 



Results 

Simulation study 

In this section, we design the detailed simulation studies 
to illustrate the performance of our ACP method in model 
selection; thereafter we conducted simulation studies to 
compare the performance of the proposed RSPLIS with 



that of PLIS in G WAS. All the simulations that follow were 
replicated 100 times. 

Simulations of the ACP method performance for model 
selection 

Simulations in this subsection were conducted to compare 
the performance of our ACP method with that of BIC- 
based partitioning (BICP) method for selecting change 
points and the number of components. For simplicity, 
we consider a single chromosome that has five station- 
ary regions. We assume each region has the same length 
Lq and set Iq equal to 600, 900 and 1200. The detailed 
simulation parameter settings are given in Table 1. With 
different parameter settings, we expect that the first two 
change points can be identified easily, while the last two 
change points are harder to be identified. 

In this simulation, we give the candidate change point 
set 



{300/; / = 1, 2, . . . 



5Io - 300 
300 



which ensures that the true change point set 
{iL^; / = 1, . . . , 4} C w^. To compare the performance 
of the two methods, we used sensitivity and specificity 
as measures. Sensitivity is defined as the average pro- 
portions of the true change points which are correctly 
identified as change points over the 100 times and the 
specificity is defined as the average proportions of the 
false change points which are not identified as the change 
points over the 100 times. We set Kmax = 8 and m^ax = 5 
for our ACP method. At the same time, h takes the values 
of 2, and 20 for the window. 

The simulation results are summarized in Tables 2 
and 3. From these tables, we can see that BICP misses 
most of the true change points and the true number of 
components. Moreover, we can also see our ACP has 



Table 1 Parameter settings of simulated data 



Region 



Transition matrix 



0.98 0.02 

0.20 0.80 

0.98 0.02 

0.95 0.05 

0.98 0.02 

0.45 0.55 

0.98 0.02 

0.15 0.85 

0.98 0.02 

0.15 0.85 



Stationary distribution 



(0.91,0.09) 



(0.98, 0.02) 



(0.96, 0.04) 



(0.88, 0.12) 



(0.88, 0.12) 



z-value distribution 
(Null) 

A/(0,1) 
A/(0,1) 
A/(OJ) 



A/(0,1) 



A/(0,1) 



z-value distribution 
(No-null) 

0.1A/(1.0,1) + 0.9A/(3.0,1) 
0.8A/(1.5,1) + 0.2A/(4.5,1) 
0.4A/(1.5,1) + 0.6A/(3.5,1) 
0.2A/(1.0,1) + 0.8A/(3.0,1) 
0.2A/(1.0,1) + 0.8A/(3.0,1) 
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Table 2 The results of comparing ACP method with BICP 
method for selecting the true number of components 

m = 2 



Length of region 


Measures 




ACP 




BICP 


iL) 




h = 2 


h = 10 


h = 20 




600 


Sensitivity 


0.86 


0.83 


0.21 


0.18 




Specif ity 


0.97 


0.96 


0.80 


0.80 


900 


Sensitivity 


0.83 


0.80 


0.13 


0.16 




Specifity 


0.96 


0.96 


0.78 


0.79 


1200 


Sensitivity 


0.81 


0.78 


0.12 


0.15 




Specifity 


0.95 


0.94 


0.78 


0.79 



higher specificity than BICP. The reason for the poorer 
behavior of BICP may be related to the lack of indepen- 
dent observations in this experiment, so there may be a 
smaller effective sample size for BIC. In addition, based on 
the simulation results, we can see that the performance of 
our ACP is very good for h = 2 and h = 10, but is very 
poor for h = 20, so we suggest h should not be more than 
10 in practice. 

HMM-based simulations for comparing RSPLIS with PUS in 
GWAS 

For simplicity, in this simulation, suppose there are two 
chromosomes (c = 2) in total, each of which consists of 
two stationary regions, and each region has 2000 SNPs 
{Lcr = 2000, r = l,2,c = 1,2). For each chromo- 
some, we set Kmax = 4, m^ax = 3, and /z = 2 for 
our ACP method and gave the candidate change point 
set M/^ = {300/; / = 1, 2, . . . , ^^^^^^}, c = 1, 2. The pur- 
pose of this simulation is to compare RSPLIS with PLIS 
by finding disease-associated SNPs while controlling the 
FDR at a pre-specified level a = O.l for the two chromo- 
somes (combining chromosomes 1 and 2). We conducted 
simulation studies in the following two cases. 

Case 1. In this case, we varied the dependence param- 
eters in transition matrices of HMM and kept the other 



Table 3 The results of comparing ACP method with BICP 
method for selecting the true change point set 
{/L;/ = 1,...,4} 



Length of region 


Measures 




ACP 




BICP 


iL) 




h = 2 


h = 10 


h = 20 




600 


Sensitivity 


0.91 


0.90 


0.27 


0.22 




Specifity 


0.90 


0.85 


0.74 


0.73 


900 


Sensitivity 


0.68 


0.67 


0.13 


0.13 




Specifity 


0.86 


0.85 


0.81 


0.83 


1200 


Sensitivity 


0.42 


0.29 


0.08 


0.08 




Specifity 


0.89 


0.82 


0.86 


0.87 



parameters fixed, and then we investigated the behavior 
of RSPLIS and compare it with PLIS procedure to identify 
casual SNPs at the different disease risk levels. We used 
the parameter settings in Table 4, where we varied the 
degree of dependence among SNPs in region 2 by chang- 
ing the value of vi (ui = 0, 0.15, 0.30, 0.45). Furthermore, 
we let IJL21 = l-iil + 1.5, jjiyl — + varied 
the disease risk parameter /x^^^ from 0.5 to 2.0 with an 
increment of 0.5. 

Case 2 In contrast to Case 1, to assess the performance 
of RSPLIS at the different disease risk levels, we varied 
the parameters of the :2-value distribution while fixing 
the other parameters. We used the parameter settings in 
Table 5, where we varied the parameters of the z-value dis- 
tribution by changing the value of V2 (1^2 = 0.5, 1, 1.5, 2). 
Furthermore, we let 1121 = + ~ /^n +0-^ 

and varied the disease risk parameter from 0.5 to 2.0 
with an increment of 0.5. 

The simulation results are shown in Figures 1, 2, 3 
and 4. From Figure 1 and Figure 3, we can obviously see 
that both RSPLIS and PLIS are well controlled at FDR 
level 0.1 asymptotically. Figure 2 and Figure 4 inform 
us that PLIS is dominated by RSPLIS for the power at 
Case 1 and Case 2, which indicates that our RSPLIS 
procedure is effective at dividing the chromosomes into 
smaller and more homogeneous regions by searching the 
change points. In addition, the difference in FNR levels 
(RSPLIS vs. PLIS) becomes smaller as jjiyl increases for 
each model, which implies that RSPLIS is especially useful 
when the disease signals are weak. 

To show that the higher power of RSPLIS is not gained 
to the detriment of a higher FDR level, we conducted a 
further simulation study. This study evaluated the sensi- 
tivities at different FDR levels for ui = 0,0.15,0.30,0.45 
in Case 1, and V2 — 0.5, 1.0, 1.5, 2.0 in Case 2, where the 
sensitivities were calculated as the average proportions of 
correctly identified SNPs over the 100 replications. For the 
purpose of illustration, we have only listed the results for 
v\ — 0.15 of Case 1 and — 1.0 of Case 2 in Figure 5 
and Figure 6 respectively, because the other results were 
broadly similar. It is clear from Figures 5 and 6 that RSPLIS 
discovers more true disease-associated SNPs than PLIS at 
the same FDR level. 

Genotype-based simulations for comparing RSPLIS with PLIS 

This simulation evaluated the performance of selecting 
the relevant SNPs for RSPLIS and PLIS based on the 
genotype data set. In contrast to the simulation study in 
Subsection 'Application to the Daly data set', we gener- 
ated case-control genotype data with more realistic LD 
patterns. To this end, we constructed a genotype pool 
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Table 4 Parameter settings of simulated data for Case 1 

Chromosome Region Transition matrix 

(c) (r) 



z-value distribution 
(Null) 



z-value distribution 
(Non-null) 



1 



1 



0.98 0.02 

0.03 0.97 

0.98 0.02 

0.03 + U] 0.97 - U] 



A/(0,1) 



A/(OJ) 



a/(m5V,i) 



A/(A6^V=/x5y + 1.5,1) 



0.98 0.02 

0.05 0.95 

0.98 0.02 

0.05 + ui 0.95 -ui 



A/(OJ) 



A/(OJ) 



A/(A6j^>=/x{V +0.5,1) 



A/(A6^^^=Mi^^+ 2.0,1) 



composed of genotypes from 60 samples for 23 chro- 
mosomes by randomly matching the pair of the known 
phased 120 haplotypes from the Illumina 550K. For sim- 
plicity, we only used SNPs selected from 2001-th SNP 
to 6000-th SNP of chromosome 7 and chromosome 15, 
respectively. Four SNPs were selected from each chromo- 
some as the disease causal SNPs, each with a relative risk 
of 1.5. Specifically, the four SNPs, 400-th, 900-th, 1750- 
th, 3200-th, were chosen to be far away on chromosome 
7, while the four other SNPs, 5600-th, 5604-th, 5608-th, 
5612-th, were chosen based on their proximity to each 
other (i.e., separated by 3 SNPs) on chromosome 15. 

For each subject, we first obtained the genotype, X, by 
drawing a genotype from the genotype pool at random. 
Using genotype X, we then simulated the disease status, Y", 
of this subject using the logistic regression model, 

exp(^o + E PiXi) 

P(Y =1\X) = , 

1 + exp(^o + E Pi^i) 

i=l 



where Po = —9.5425 for a disease rate of 0.03, Pi = 
log(1.5), for / = 1, . . . , 8. We repeated the sampling 
procedure until we obtained 1000 cases and 1000 con- 
trols are obtained. The eight disease casual SNPs were 
then removed from the simulated data set, and the 39 
SNPs that contained the three adjacent SNPs on each 
side of the eight disease-causal SNPs were regarded as 
the relevant SNPs. We assessed the performance of the 
testing procedure by the selection rate of relevant SNPs, 
where the percentages of the true positives (sensitivity) 
selected by the top M SNPs could be calculated easily. 
We set mynax = 6, /z = 2, and Kmax = 5 for the ACP 
method. 

We have plotted the average sensitivity curves for com- 
parisons of RSPLIS vs. PLIS in Figure 7. It is apparent that 
our RSPLIS dominates PLIS in ranking the revelant SNPs. 
In summary, these results show that exploiting the hetero- 
geneous chromosomal regions and searching the change 
points to find chromosomal regions that are more homo- 
geneous has improved the precision of RSPLIS in that 
the number of false positives has been reduced while the 
statistical power has increased. 



Table 5 Parameter settings of simulated data for Case 2 



Chromosome 

(c) 

1 



Region 

(r) 

1 



Transition matrix 



0.98 0.02 

0.025 0.975 

0.98 0.02 

0.025 0.975 

0.98 0.02 

0.025 0.975 

0.98 0.02 

0.025 0.975 



z-value distribution 
(Null) 

A/(OJ) 
A/(0,1) 

A/(0,1) 



A/(0,1) 



z-value distribution 
(Non-null) 



A/(At^?J) 



A/(A6i^>=MiV+ 0.5,1) 
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Q 0.1 




0.15 



Q 0.1 




Q 0.1 




Figure 1 Shows that the FDR levels of the two methods are controlled at the level 0.1 0 asymptotically for four different parameter 
settings in Case 1. 



Application to the Daly data set 

The data are derived from the 616 kilobase region of 
human chromosome 5q31 that may contain a genetic 
variant responsible for Crohn's disease as determined 
by genotyping 103 common SNPs (>0.05 minor allele 
frequency) for 129 trios [18]. All offspring belong to 
the case population, while almost all parents belong 
to the control population. In the entire data, there 



are 144 cases and 243 controls. Daly et al. [18] have 
also shown that there are 11 blocks and strong LD 
between the SNPs and their neighboring SNPs in each 
block. 

Model selection and estimation ofHMM parameters 

First, we used the /C-Nearest neighbor method proposed 
by the R package [29] to impute the missing genotypes 




- RSPLIS 

- PLIS 




0.18 
0.16 
0.14 



0.12^ 



- RSPLIS 

- PLIS 




Figure 2 Shows the FNR of PUS is much higher than that of RSPLIS for four different parameter settings in Case 1 . 
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v^=0.5 



Do=1.5 



\)o=1.0 




Do=2.0 




Figure 3 Shows the FDR levels of the two methods are controlled at 0.1 0 asymptotically for four different parameter settings in Case 2. 



from the Daly et al data [18]. Then, for each SNP, we 
obtained a value by conducting a x^'test to assess 
associations between the allele frequencies and the dis- 
ease status, furthermore, we get z-value by transforming 
P'Vdlue, We assumed that the null distribution is standard 
normal N{0, 1) and the non-null distribution is a normal 

mixture ^ §/A/'(/x^J% a^^^- ), where c = 1 because there is 
only one chromosome in the Daly et al. data [18]. We used 



our ACP method to select the number of components 
and change points, where the parameters in HMMs were 
estimated by the EM algorithm. Thereafter, RSPLIS was 
used for multiple testing. 

Data analysis 

Because the Daly et al. data [18] have only 103 SNPs, 
we assumed only one change point for these data in our 
analysis. Thus, we took Lmin = 20, nimax = 6, /z = 2 and 



- RSPLIS 

- PLIS 




0.5 1 1.5 2 



- RSPLIS 

- PLIS 




z 0.15 



- RSPLIS 

- PLIS 




0.5 1 1.5 2 

„(1) 



RSPLIS 
PLIS 




^^11 ^^11 ^^11 

Figure 4 Shows the FNR of PUS is much higher than that of RSPLIS for four different parameter settings in Case 2. 



0.5 1 1.5 2 
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- PLIS 

- RSPLIS 



0.4 0.6 
FDR 



0.8 




- PLIS 

- RSPLIS 



0.2 



0.4 0.6 
FDR 





0.2 



0.4 0.6 
FDR 



0.4 0.6 
FDR 



- PLIS 

- RSPLIS 



0.8 



- PLIS 

- RSPLIS 



Figure 5 Shows the ranking efficiency for = 0 in Case 1 : RSPLIS has higher sensitivity than PLIS at the same FDR level; there is a more 
dramatic improvement when the signals are weak {fi^^^ is small). 



took Kmax = 1 for our ACP method. For the purpose of 
comparison, we also used the PLIS method to analyze the 
Daly et al. data [18]. The data [18] were collected to iden- 
tify genetic variants conferring susceptibility to Crohn's 
disease and nine SNPs were identified [30]. For the pur- 
pose of illustration, we only list here the LIS statistics 
and LIS ranks for the nine casual SNPs (Table 6). Based 



on the definition of LIS statistic given in Section 'Pooled 
FDR control procedure for different chromosomes with 
multiple regions', it is obvious that the smaller value of 
the LIS statistic means a larger probability that this SNP is 
associated with the disease. From Table 6, the rankings for 
eight causal SNPs illustrate that RSPLIS offers a marked 
improvement over PLIS, with the exception of locus 73. It 




- PLIS 

- RSPLIS 



0.4 0.6 
FDR 




0.4 0.6 
FDR 




FDR FDR 

Figure 6 Shows the ranking efficiency for = 0 in Case 2: RSPLIS has higher sensitivity than PLIS at the same FDR level; there is a more 
dramatic improvement when the signals are weak {^^^^ is small). 
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"O 50 100 150 200 250 300 350 400 450 

TOP M SNPs 



Figure 7 The four SNPs in Chr 1 are far away from each other, 
while the four in Chr 2 are separated by only three SNPs. We 

calculated the sensitivity selected by the top M SNPs, ranked by 
RSPLISand by PLIS. 



is not surprising that not all of the nine causal SNPs are top 
ranked because non-casual SNPs that are strongly linked 
to the casual SNP may also be top-ranked. In summary, 
we can see that our method not only makes better rank- 
ings but also has smaller values for LIS statistics for most 
of the true casual SNPs. In addition, Table 6 shows that the 
LIS values obtained using RSPLIS are far lower than those 
obtained using PLIS. The reason may be that each region 
found by the ACP method has a smaller sample size for 
statistical inference in HMM, so this may affect the values 
obtained for the LIS statistic. 

Discussion 

Large-scale multiple testing under dependence is holding 
promise in identifying genetic variants for GWAS. Previ- 
ous research has focused on large-scale multiple testing 
under a HMM for a single chromosome ([14,15]). In the 
present paper, we extended chromosome-specific PLIS 



Table 6 Results of PLIS and RSPLIS for the known 9 casual 
SNPs in Daly data, which were reported as significant 
(Rioux etal., 2001) 

SNP name SNP location LIS statistics LIS ranks 





(th) 


RSPLIS 


PLIS 


RSPLIS 


PLIS 


IGR2055a-l 


25 


1.5994e-18 


7.541 le-04 


5 


7 


IGR2060a-l 


26 


1.7388e-18 


8.5125e-04 


7 


9 


IGR2063b-l 


27 


1.7389e-18 


8.4605e-04 


6 


8 


IGR2096a-l 


33 


8.0144e-18 


4.1287e-03 


39 


51 


IGR2198a-l 


38 


1.0731e-17 


5.3388e-03 


56 


79 


IGR2230a-l 


48 


3.4931 e-1 8 


2.1243e-03 


11 


13 


IGR3081a-l 


73 


7.6344e-18 


3.9276e-03 


37 


34 


IGR3096a-l 


77 


3.7265e-18 


2.3153e-03 


12 


14 


IGR3236a-l 


92 


1.3554e-18 


5.9574e-04 


3 


5 



to RSPLIS to analyze SNP data arising from large-scale 
GWAS by an adaptive penalized criterion. By dividing the 
whole chromosome into more homogeneous regions and 
conducting the extended pooling dependent testing pro- 
cedure, we showed that the accuracy of a multiple testing 
procedure was improved when there are multiple change 
points along the whole chromosome. The numerical per- 
formances of our RSPLIS procedure were investigated 
using both simulated studies and real data analysis. The 
results showed that RSPLIS is more powerful than PLIS at 
identifying small effects in GWAS. 

However, our method could be improved in several 
ways. In the present paper, we conducted large-scale mul- 
tiple testing under a special form of dependence (HMM) 
for the hypotheses. Because complex LD structure(s) are 
usually stored in SNP data, the Markov chain may not be 
the most appropriate model for SNP dependence. There- 
fore, general forms of dependence such as the Markov 
random field should be considered in future, where the 
whole network is divided into a region-specific Markov 
random field network; this would improve the screening 
efficiency in GWAS. 

Besides, the question of how to select an ideal candidate 
change point set is one issue that needs further consider- 
ation. Clearly, better prior knowledge can help us find the 
change point set to reduce the space of competing models 
in the model selection procedure. Thus, a better algo- 
rithm needs to be developed by using prior information to 
obtain the candidate change point set. 

The computational complexity and feasibility of our 
RSPLIS approach for analyzing GWAS data that contain 
tens of thousands of SNPs merit further discussion. The 
RSPLIS method is made up of three independent pro- 
cedures: a procedure for getting the candidate change 
point set, the adaptive criterion-based model selection 
procedure with HMM parameter estimations, and the 
pooled FDR control procedure for all the chromosomes 
with multiple regions, where the second procedure is the 
most time consuming. Fortunately, our method runs the 
second procedure chromosome-by-chromosome, which 
facilitates parallel computing. For each chromosome, say, 
the c-th chromosome, the computational complexity for 
three procedures is 0(1^.^1,), 0(STKl^^mmax\w%c). 
and 0(Lc log(Lc)) respectively, where Lc denotes the num- 
ber of SNPs on the c-th chromosome, \w^\ denotes the 
number of the change points in candidate change point 
set w^, T = 50 in our Algorithm 2, and rrimax is usu- 
ally chosen between four and six [15]. In addition, our 
method is very flexible, which allows users to set the min- 
imum length value of block {Lyyiin) as well as the maximum 
value of the number of change points (Kmax) in large-scale 
GWAS. Based on our simulation studies, with the setting, 
Lmin = 300, Kmax = 5, T = 50, and rrimax = 6, it took 
about 40 min for our RSPLIS procedure to analyze 8000 
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SNPs from two chromosomes. We expect that the running 
time for large-scale GWAS is still acceptable because we 
can use parallel computing for each chromosome. 

Conclusions 

In this paper, we first modeled the observed dependent 
SNP data via region-specific multiple HMMs divided by 
change points, where we developed a novel data-driven 
penalized criterion combined with the DP algorithm 
to find change points. Second, we proposed a RSPLIS 
method to conduct the dependent tests from multiple 
chromosomes with different regions for GWAS. Finally, 
we have shown the numerical performances of the RSPLIS 
procedure using both simulated studies and analysis of a 
real data set. 

Availability 

Matlab and R code for RSPLIS can be accessed at http:// 
math.nenu.edu.cn/faculty/wszhu/softwares/RSPLIS. 
html. This site contains the program files and code 
introduction. 
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Additional file 1 : The derivation of the dynamic program algorithm 
and the proof of theorem 1 . Additional file 1 contains the derivation of 
the dynamic program (DP) algorithm for the step 2 in Algorithm 1 , the 
derivation of the RSPLIS procedure and proof of theorem 1. 
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